Stochastic Dimensionality Reduction for K-means 

Clustering 



Christos Boutsidis Anastasios Zouzias 

Mathematical Sciences Department Computer Science Department 

IBM T. J. Watson Research Center University of Toronto 

cboutsi@us.ibm.com zouzias@cs.toronto.edu 

Michael W. Mahoney Petros Drineas 

Department of Mathematics Computer Science Department 

Stanford Rensselaer Polytechnic Institute 

mmahoney @cs . st anf or d . edu dr inep@cs . r pi . edu 



Abstract 

Wc study the topic of dimensionality reduction methods for fc-means clustering. Dimension- 
ality reduction encompasses the union of two approaches; feature selection and feature extraction. 
First, feature selection selects a small subset of actual features from the data and then runs the 
clustering algorithm only on the selected features. Second, feature extraction constructs a small 
set of new artificial features and then runs the clustering algorithm only on the constructed 
features. Despite the significance of the problem as well as the wealth of heuristic methods 
addressing it there exist no provably accurate feature selection methods. On the other hand, 
two provably accurate feature extraction methods for /c-means exist: the first one is randomized 
and is based on Random Projections; the other, is deterministic and it is based on the Singular 
Value Decomposition. 

This paper addresses this shortcoming by presenting the first provably accurate feature 
selection method for fc-means clustering. We also present two novel feature extraction methods: 
the first one is based on Random Projections and improves the existing result in terms of speed 
and number of features needed to be extracted; the other is based on fast approximate SVD 
factorizations and improves the existing result in terms of speed. All three methods of our 
work are randomized and, with constant probability, provide constant-factor approximation 
guarantees with respect to the optimal fc-means objective value. 

1 Introduction 

Clustering is ubiquitous in science and engineering, with numerous and diverse application domains, 
ranging from bio-informatics and medicine to the social sciences and the web [14] . Perhaps the 
most well-known clustering algorithm is the so-called "/c-means" algorithm or Lloyd's method [19] . 
an iterative expectation-maximization type approach, which attempts to address the following 
objective: given a set of points in Euclidean space and the number of clusters k, split the points 
into k clusters so that the total sum of the (squared Euclidean) distances of each point to its nearest 
cluster center is minimized (see below for the formal mathematical formulation of this statement, 
a.k.a. the /c- means clustering problem). The effectiveness of the Lloyd's method [22], have made 
/c- means enormously popular in applications [25] . 
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In recent years, high dimensionahty of modern massive datasets has provided a considerable 
chahenge to /c- means clustering approaches. First, the curse of dimensionality makes algorithms for 
/c-means clustering very slow, and, second, the existence of many irrelevant features may not allow 
the identification of the relevant underlying structure in the data jllj . Practitioners addressed such 
obstacles by introducing feature selection and feature extraction techniques. It is worth noting 
that feature selection selects a small subset of actual features from the data and runs the clustering 
algorithm only on the selected features, whereas feature extraction constructs a small set of artificial 
features and runs the clustering algorithm on the constructed features. Below, we describe the 
mathematical framework into which we will study such dimensionality reductions methods. 

Consider m points V = {pi,p2, . . . ,Pm} ^ 1^" and integer k denoting the number of clusters. 
The objective of A;-means is to find a /c-partition of V such that points that are "close" to each 
other belong to the same cluster and points that are "far" from each other belong to different 
clusters. A A;-partition of "P is a collection S = {Si,S2, ■ ■ ■ ,Sk} of k non-empty pairwise disjoint 
sets which covers V. Let Sj = \Sj\ be the size of Sj. For each set Sj, let fij G M" be its centroid: 
fXj = {J2pieSj Pi)/'^j- The /c-means objective function is 

m 

:F{r,s) = J2\\p^-^^ip^)\\^ 

i=l 

where /^(pi) G M** is the centroid of the cluster to which pj belongs. The goal of A;-means is to find 
the optimal fc-partition of the points in V, 

Sopt = argmin 
s 

The goal of dimensionality reduction is to construct points V = {pi, P2, • • ■ , Pm} ^ (for some 
r <^ n specified in advance). Feature selection constructs the pj's by selecting actual features of 
the corresponding Pi's, while feature extraction does so by constructing new artificial features. 
Consider the optimum /c-means partition of the points in V, 

Sopt = argmin 
s 

The goal of a dimensionality reduction algorithm for /c-means clustering is to construct the new set 
V such that 

'-'opt ) ■ 

(Here, /3 is a small constant, for example /3 = 1,2,3. The parameter e > is given as input and 
one minimizes r, which depends on e, to achieve the desired accuracy /3 -|- e.) In words, computing 
an optimal partition Sopt on the projected low-dimensional data and plugging it back to cluster 
the high dimensional data, gives a constant factor approximation to the optimal clustering. Notice 
that we measure approximability by evaluating the /c-means objective function, an approach which 
is not new [211 El El El El El [3] . Comparing Sopt to Sopt would be much more interesting but at 
the same time a much harder (combinatorial) problem. 

1.1 Prior Work 

Despite the significance of dimensionality reduction in the context of clustering, as well as the 
wealth of heuristic methods addressing it [10], there exists no provably accurate feature selection 
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methods for fc-means clustering. On the other hand, two provably accurate feature extraction 
methods exist that we briefly describe next. 

First, a folklore result by [16] indicates that one can construct r = 0(log(m)/e^) artificial fea- 
tures with Random Projections and, with high probability, obtain a (1 + e)-approximate clustering. 
(In the parlance of our framework, /3 = 1). We discuss Random Projections in Section [2^31 The al- 
gorithm implied by [16] is as follows: let A £ j^^-x" contains the points V = {pi, P25 ■ ■ ■ , Pm} ^ I^" 
as its rows; then, multiply A from the right with a random projection matrix R G M"^^' to construct 
C = AR G iK'^x'^' containing the points V = {pi,P2, • • • , Pm} Q as its rows. The proof that 
this approach gives a (1 -|- e)-approximate clustering is immediate. |16j proved that all the pairwise 
Euclidean distances of the points of V are preserved within a multiplicative factor 1 it e, so does 
any - hence the optimal - value of the fe-means objective function. 

Second, [7] argue that one can construct r = k artificial features using the SVD, in 0{mn min{m, n}) 
time, to obtain a 2-approximation on the clustering quality. The algorithm of |7] is as follows: given 
A G M»»x" containing the points of V and k, construct C = AV^ G M™-^*^. Here, G M"^^ con- 
tains the top k right singular vectors of A. The proof of this result is briefly discussed in Section O 

Finally, we should note that a discussion of existing dimensionality reduction methods with no 
theoretically provable performance is beyond the scope of the present paper. 



Reference 


Description 


Dimensions 


Time = 0{x),x = 


Error 


Folklore 


RP 


0(log(?n)/e^) 


mn\e~'^ log(m) / log(n)] 


1 + e 


rn 


Exact SVD 


k 


mn min{m, n} 


2 


Theorem 1111 


RS 


0{k\og{k)/e') 


mnk/e + Iq 


3 + e 


Theorem 1121 


RP 


0{k/e') 


mn \e~'^kl log(n)] 


2 + e 


Theorem 1131 


Approx. SVD 


k 


mnk/e 


2 + e 



Table 1: Provably accurate dimensionality reduction methods for fc-means clustering. RP stands for Random 
Projections, similarly for RS and Random Sampling. The technique in the second row of the table is 
deterministic; the others fail with, say, a constant probability. In the RP methods, the construction is 
done with random sign matrices and the mailman algorithm for matrix multiplication (see Sections 12.31 
andlH respectively). In Theorems [TTl [121 and [13] we assume 7=1. Finally, in the third row of the table, 
to = fclog(fc)e-2log(fclog(fc)e-i). 



1.2 Summary of our Contributions 

We present the first provably accurate feature selection algorithm for A;-means: Theorem 1111 presents 
a 0{mnke~^ + A; log(A:)e~^ log(A: log(fc)e~^)) time randomized algorithm that, with constant proba- 
bility, achieves a (3 + e)-error with r = 0{k\og{k) / e^) features. Given A and k, the algorithm of 
this theorem computes Z G M'^^'^, which approximates V^. G W^^^ which contains the top k right 
singular vectors of aH. Then, the selection of the features (columns of A) is done with a standard 
randomized sampling approach with replacement with probabilities that are computed from the 
matrix Z. The proof of Theorem I 111 is a synthesis of ideas from [7] and [23j . 

^[5] presented an unsupervised feature selection algorithm by working with the matrix V^; in this work, we show 
that the same approximation bound can be achieved by working with a matrix that approximates in the sense of 
low rank matrix approximations (see Lemma [2]). 
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Moreover, we describe a random-projection-type feature extraction algorithm: Theorem [T2] 
presents a 0(mn[e~^/c/ log(n)] ) time algorithm that, with constant probability, achieves a (2 -|- e)- 
error with r = 0(/c/e^) artificial features. We improve the above folklore result by means of showing 
that a smaller number of features are enough to obtain an approximate clustering. The algorithm 
of Theorem [12] is the same as with the one in the standard result for random projections that we 
outlined in the prior work section but uses only r = 0{k/e'^) dimensions for the random projection 
matrix, which breaks the fundamental result of [1^. Our proof relies on ideas from [7j as well 
as [24]. 

Finally, Theorem [13] describes a feature extraction algorithm that employs approximate SVD 
decompositions and constructs r = k artificial features in 0{mnk/e) time such that, with constant 
probability, the clustering error is at most a 2 + e factor from the optimal. We improve the existing 
SVD dimensionality reduction method by showing that fast approximate SVD gives features that 
can do almost as well as the features from the exact, though expensive, SVD. Our algorithm and 
proof are similar to those in [7] , but we show that one only needs to compute an approximate SVD 
of A. We summarize previous results as well as ours in Table [H 

2 Preliminaries 

Basic Notation. We usG A, B, . . . to d-BnotG mcttricGSj 3., p, . . . to ci6iiotG column v6ctors. I72 is 
the n X n identity matrix; Omxn is the m x n matrix of zeros; is the i-th row of A; A(j) is the 
j-th column of A; and, Ajj denotes the (i, j)-th element of A. We use Ey to take the expectation 
of a random variable Y and to take the probability of a probabilistic event £. We abbreviate 
"independent identically distributed" to "i.i.d" and "with probability" to "w.p". 



and ||A||2 = maXx:|jx||2=i l|Ax||2, respectively (for a matrix A). For any A,B: ||A||2 < ||A||f, 
||AB||f < II A||f||B||2, and ||AB||f < || A||2||B||f. The latter two properties are stronger versions 
of the standard submultiplicativity property: ||AB||f < ||A||f||B||f. We will refer to these versions 
as spectral submultiplicativity. Finally, the triangle inequality of matrix norms indicates that 
||A-fB||f < ||A||f + ||B||f. 



Lemma 1 (Matrix Pythagorean Theorem). Let X, Y G M""^" satisfy XY^ = 

mxm- Then, 

||X + Y||| = ||X||^ + ||Y|||. 



This matrix form of the Pythagorean theorem is the starting point for the proofs of the three main 
theorems presented in this work. The idea to use Matrix Pythagoras to analyze a dimensionality 
reduction method for /c-means was initially introduced in [^ and it turns to be very useful to prove 
our results as well. 



Matrix norms. We use the Frobenius and the spectral 




Proof. 



X + Y 




4 



Singular Value Decomposition. The SVD of A G ]^'rnxn rank p < min{?n, 7i} is 




with singular values cti > . . . > cr^, > crk~\-i > . . . > cJp > 0. We will use fjj (A) to denote the i-th 
singular value of A when the matrix is not clear from the context. The matrices G jgmxfc g^^id 
Up_fc G ]K™x(p~'=) contain the left singular vectors of A; and, similarly, the matrices G R"^'^ 
and Vp_fe G M"^(''^^') contain the right singular vectors. Sfc G M''^^ and Sp_fc G m(/'-'=)^(p-'=) 
contain the singular values of A. It is well-known that A^ = UfcXl^V^ minimizes ||A — X||f over 
ah matrices X G M™><" of rank at most k < p. We use Ap_fc = A - A^ = Up_fc5]p_feVj_^. 

Also, ||A||f = \Jj2i=i '^ii-^) ^'^d ll-^lb = o"i(A). The best rank k approximation to A satisfies: 

l|A-A,||F = ^Ef=fc+i^f(A). 

Approximate Singular Value Decomposition. The exact SVD of A takes cubic time. In this 
work, to speed up certain algorithms, we will use fast approximate SVD. We quote a recent result 
from [3], but similar relative-error Frobenius norm SVD approximations can be found elsewhere; 
see, for example, [53]. The exact description of the algorithm of the following lemma is out of the 
scope of the present work. 

Lemma 2. Given A G R"*^" of rank p, a target rank 2 < k < p, and < e < 1, there exists 
a randomized algorithm that computes a matrix Z G M"^^ such that Z"'"Z = 1;^.; EZ = O^xk (for 
E = A - AZZT G M'"^";, and 

E||E||| < (l + e)||A- Afclll. 

The proposed algorithm runs in O (mnk/e) time. We use Z = FastFrobeniusSVD{A, k, e) to denote 
this randomized procedure. 

Pseudo-inverse. A"*" = VaS^^U^ G R"^™ denotes the so-called Moore-Penrose pseudo-inverse 
of A (here is the inverse of XIa), i.e. the unique n x m matrix satisfying all four properties: 
A = AA+A, A+AA+ = A+, (AA+)T = AA+, and (A+A)^ = A+A. By the SVD of A and 
A"*", it is easy to verify that, for alH = 1, . . . ,p = rank(A) = rank(A+): cjj(A^) = l/cjp_j+i(A). 
Finally, for any A G M™^", B G M"^^: (AB)+ = B+ A"*" if any one of the following three properties 
hold: (i) A'^A = I„; (ii) B'^B = 1^ ; or, (iii) rank(A) = rank(B) = n. 

Projection Matrices. P G R"^" is a projection matrix if P^ = P. For such a projection matrix 
and any A: ||PA||f < || A||f. Also, if P is a projection matrix, then, I„, — P is a projection matrix. 
So, for any matrix A, both AA"*" and I„ — AA^ are projection matrices. 

Markov's Inequality and the Union Bound. Markov's inequality can be stated as follows: 
Let y be a random variable taking non-negative values with expectation E Y. Then, for all t > 0, 
and with probability at least 1 — t"^ , Y < t-KY. We will also use the so-called union bound. Given 
a set of probabilistic events £1,82, ■ ■ ■ ,£n holding with respective probabilities pi,P2, ■ ■ ■ ,Pn, the 
probability that all events hold simultaneously (a.k.a., the probability of the union of those events) 
is upper bounded as: P(<?i U £2 ■ ■ - ^ £n) < Yll=i Pi- 
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2.1 Linear Algebraic Formulation of fc-means 

From now on, we will switch to a more convenient linear algebraic formulation of the /c-means 
clustering problem following the notation used in the introduction. Define the data matrix A S 
jgmxn^ which has the data points for its rows, 

AT = [pi,...,p„]gM"x-. 

We represent a fc-clustering S by its cluster indicator matrix X G W^^^. Each column j = 1, . . . ,k 
of X represents a cluster. Each row i = 1,. . . ,m indicates the cluster membership of the point 
Pi. So, Xjj = if and only if data point pi is in cluster Sj. Every row of X has exactly one 

non-zero element, corresponding to the cluster the data point belongs to. There are Sj non-zero 
elements in column j which indicates the data points belonging to cluster Sj . The two formulations 
are equivalent, 

m m 

J-(A,X) = II A - XX^AIll = WpI - Pi^X^AII^ = J2 WpJ " /^(P*)^ll2 = H^^S). 

i=l 1=1 

After some elementary algebra, one can verify that for i = 1, . . . ,m, p^X"'"A = /x(pj)"'". Using this 
formulation, the goal of fe-means is to find X which minimizes || A — XX"'"A||^. 

To evaluate the quality of different clusterings, we will use the A;-means objective function. 
Given some clustering X, we are interested in the ratio J^(A, X)/J^(A, Xopt), where Xopt is the 
optimal clustering of A. The choice of evaluating a clustering this way is not new. In fact, [21^ 
\T7\ [T3\ [T2t [9l [22] provide results (other than dimensionality reduction methods) along the same 
lines. Below, we give the formal definitions of the /c-means problem and a A:-means approximation 
algorithm. 

Definition 3. [The k-means clustering problem] Given A G k™-x" (representing m data 
points - rows - described with respect to n features - columns) and a positive integer k denoting 
the number of clusters, find the indicator matrix Xopt G K™^'^ which satisfies, 

Xopt = argmin\\A. — XX'^A|||. 

The optimal value of the k-means clustering objective is 

T{A, Xopt) = min || A - XX^AlU = || A - XoptX^p^AlU = Fopt- 

In the above, X denotes the set of all m x k indicator matrices X. 

Definition 4. [k-means approximation algorithm] An algorithm is a "^-approximation" for 
the k-means clustering problem ('j >i) if it takes inputs the dataset A G i^™x'" Qi^g fjig number of 
clusters k, and returns an indicator matrix X^ G M™xfc such that w.p. at least 1 — 6.y, 

II A - X^X^AIll < 7 min || A - XX^AlU = jT{A, Xopt) = TFopt- 

An example of such an algorithm is |17j with 7 = l + e(0<e<l), and 5^ some constant in (0, 1). 
The corresponding running time is 0{mn ■ 2^'^^^^'^^'^^). 
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2.2 Randomized Sampling 

Sampling and Rescaling Matrices. Let A = [A^^\ . . . , A^")] G M™^" and let C = [A(*i), . . . , A(* 

j^mxr consist of r < n columns of A. Note that C = Af2, where the sampling matrix CI = 
[ej^ , . . . , ej^] G M"^^ (here are the standard basis vectors in M"). If S € M^^'' is a diagonal 
rescaling matrix then AfiS contains r rescaled columns of A. 

The following definition describes a simple randomized sampling procedure with replacement, 
which will be critical in our feature selection algorithm. 

Definition 5 (Random Sampling with Replacement). Let X € W^^^ with n > k and let X(j) 
denote the i-th row of ^ as a row vector. For all i = 1, . . . ,n, define the following set of sampling 
probabilities: 



Pi 



l^l|2 

I^IIf 



and note that Y17=iPi ~ 1- r be a positive integer and construct the sampling matrix ft G M"^'' 
and the rescaling matrix S € M*"^^ as follows: initially, fl = 0„xr o,nd S = O^xr-;' for t = 1, . . . ,r 
pick an integer it from the set {1,2,... ,n} where the probability of picking i is equal to pi; set 
rijjt = 1 and Stt = We denote this randomized sampling technique with replacement by 

[ri, S] = RandomizedSampling(X.,r). 

Note that this procedure can be implemented in 0{nk + rlog{r)) time. 

The next three lemmas present the effect of the above sampling procedure on orthogonal matri- 
ces. The first two lemmas are known; short proofs are included for the sake of completeness. The 
third lemma follows easily from the first two results. 

Lemma 6. Let V G M"^*-' with n > k and V'^V = 1^. Let < 5 < 1, 4kln{2k/S) < r < n, and 
[ft, S] = RandomizedSampling(V , r). Then, for all i = 1, . . . ,k, w.p. at least 1 — 6, 



Proof. This result was originally proven in |23] . We will leverage a more recent proof of this result 
that appeared in |2Uj and improves the original constants. More specifically, in Theorem 2 of |2Uj . 
set S = I, /3 = 1, and replace e as a function of r, /3, and d to conclude the proof. ■ 

Lemma 7. For any r > 1, X G M"^^, and Y G M™-^", let [fi, S] = RandomizedSampling{^,r). 
Then, w.p. at least 1 — 5, 

||Yf2S||| < \\\Y\\l. 



Proof. See Appendix. ■ 

Lemma 8. Fix A G M™^", A; > 1, < e < 1/3, < 6 < 1, and r = Akln{2k/6)/e^ . Compute the 
nxk matrix Zi of Lemma\^such i/iai A = AZZ"'" + E and run [fi, S] = RandomizedSampling{'L,r). 
Then, w.p. at least 1 — 2>5, there exists E G M"*^" such that 

= Ans(z'^ns)+z'^ + e, 

and IIEIIf < i#||E||F. 

Proof. See Appendix. ■ 
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2.3 Random Projections 

A classic result of [16] states that, for any < e < 1, any set of m points in n dimensions (rows 
in A G M"*^") can be linearly projected into = O (log(m)/e^) dimensions while preserving 
all the pairwise Euclidean distances of the points within a multiplicative factor of (lie). More 
precisely, [16] showed the existence of a (random orthonormal) matrix R G M"^'^' such that, for all 
i,j = 1, . . . , m, and with high probability (over the randomness of the matrix R), 

(1 - e)||A(,) - A(,)||2 < II (A(i) - A(,)) RII2 < (1 + e)||A(,) - A(,)||2. 

Subsequent research simplified the proof of [16] by showing that such a linear transformation can 
be generated using a random Gaussian matrix, i.e., a matrix R 6 M"^''^ whose entries are i.i.d. 
Gaussian random variables with zero mean and variance 1/r [15j . Recently, [2] presented the so- 
called Fast Johnson-Lindenstrauss Transform which describes an R G M"^^' such that the product 
AR can be computed fast. In this paper, we will use a construction by [1], who proved that a 
rescaled random sign matrix, i.e. a matrix R G M"^''^ whose entries have values {±l/^/r} uniformly 
at random, satisfies the above equation. As we will see in detail in Section U a recent result of [18] 
indicates that, if R is constructed as in [T], the product AR can be computed fast as well. We 
utilize such a random projection embedding in Section [H Here, we summarize some properties of 
such matrices that might be of independent interest. We have deferred the proofs of the following 
lemmata to the Appendix. 

Lemma 9. Fix any m x n matrix Y and e > 0. Let R G M"^*" be a rescaled random sign matrix 
constructed as described above with r = c^k/e^ , where cq > 100. Then, 

P(||YR||| > (l + e)||Y|||) < 0.01. 

Lemma 10. Let A G W^"^^ with rank p (k < p), A^ = UfcE^V^, and < e < 1/3. Let 

R G M"^*" be a (rescaled) random sign matrix constructed as we described above with r = c^k/e^, 
where cq > 3330. The following hold (simultaneously) w.p. at least 0.97: 

1. For alii = l,...,k: 1 - e < afiV^Ti) < 1 + e. 

2. There exists an m x n matrix E such that A^ = AR(VJR)"'"V^ + E and 

||E||f < 3e||A- Afcllp. 

Lemma [9] is the analog of Lemma [T] The first statement of Lemma [10] is the analog of Lemma [6] 
while the second statement of Lemma [10] is the analog of Lemma [8l The results here replace 
the sampling and rescaling matrices ft, S from Random Sampling (Definition [5|) with the Random 
Projection matrix R. It is worth noting that almost the same results can be achieved with r = 
0{k/e'^) random dimensions, while the corresponding lemmata for Random Sampling require at 
least r = 0{klogk/e'^) actual dimensions. 

3 Feature Selection with Randomized Sampling 

Given A,k, and < e < 1/3, Algorithm [T] is our main algorithm for feature selection in fc-means 
clustering. In a nutshell, construct the matrix Z with the (approximate) top-A: right singular vectors 
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Input: Dataset A G M'"^", number of clusters k, and < e < 1/3. 
Output: C G M™^^' with r = 0(A; log(/c)/e^) rescaled features. 

1: Let Z = FastFrobeniusSVD(A,/c,e); Z G (via Lemma[2D. 

2: Let r = ci • 4/cln(200A;)/e2 (ci is a sufficiently large constant - see proof). 
3: Let [Q, S] = RandomizedSampling(Z,r); Q G M"^'', S G M^^'^(via LemmaED. 
4: Return C = AfiS G M™"^^ with r rescaled columns from A. 



Algorithm 1: Stochastic Feature Selection for fc- means Clustering. 



of A and select r = 0{k\og{k)/e'^) columns from Z""" with the randomized technique of Section [ 
One can replace the first step in Algorithm [1] with the exact SVD of A [5] . The result that is 
obtained from this approach is asymptotically the same as the one we will present in Theorem [TT] 
H. Working with Z though gives a considerably faster algorithm. 

Theorem 11. Let A G R"*^" and k be inputs of the k-means clustering problem. Let e G (0, 1/3) 
and, by using Algorithm\^inO{mnk/e+k\ii{k)/e^\og{khi{k)/e)) time construct features C G M'"^'' 
with r = 0{k\og{k) / e^) . Run any ^-approximation k-means algorithm on C,k and construct X;y. 
Then, w.p. at least 0.2 — J^, 



I A - X^X^^ A||^ < (1 + (2 + e)7) ||A - XoptX„^ptA||^. 



Proof. We start by manipulating the term ||A — X;yXTA||p. Notice that A = AZZ^ + E (from 



Lemma [2]). Also, 

((I^ - X^XT) AZZ^) ((I^ - X^XT) E)^ = 0, 
because Z^e"^ = Ok xm; by construction. Now, using Matrix Pythagoras (see Lemma [1]), 



X;^xTA||^ = 11(1^ - X^XT)AZZT||^ + ||(I™ - X;^xT)E||| . (1) 



n el 



We first bound the second term of Eqn. ([T|). S is a projection matrix, it can be 

dropped without increasing the Frobenius norm (see Section [2|). Applying Markov's inequalitj0 on 
the equation of Lemma [21 we obtain that w.p. 0.99, 

||E||^ < (l + 100e)||A- Afclll. 

Note also that XoptXjp^.A has rank at most k; so, from the optimality of the SVD, overall, 

el<{\ + 100e)||A - Afell^ < (1 + 100e)||A - XoptX^ ^A||| = (1 + 100e)Fopt. 



^The main theorem of [5] states a (1 + (1 + e)7)-approximation bound but the corresponding proof has a bug, 
which is fixable and leads to a (1 + (2 + e)7)-approximation bound. One can repUcate the corresponding (fixable) 
proof in [5] by replacing Z = "Vk in the proof of Theorem [TT] of our work. 

*E ||E||| < (1 + e)|jA - Afelll -s- E ||E||p - ||A - Afc||| < e||A - Ak\\p. Now, apply Markov's inequality on the 
random variable Y = ||E||| - ||A - Afc||p > 0. (Y > because E = A - AZZ'^ and rank(AZZ'^) = k). This gives 
llElll - ||A - Afclll < 100e||A - A^Wl w.p. 0.99; so, ||E||| < \\A - A^fp + 100e||A - Afc|||. 
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We now bound the first term in Eqn. ([T]), 

01 < 11(1^ -X;^xT)AnS(zTriS)+Z^||F + ||E||f (2) 

< ||(I^-X^xT)AnS||F||(zTnS)+||2 + ||E||F (3) 

< ^||(I„-XoptXTt)AnS||F||(zTriS)+||2 + ||E||F (4) 

In Eqn. ([2|), we used Lemma [8] (for an unspecified failure probability 5), the triangle inequality, 
and the fact that — X^^X? is a projection matrix and can be dropped without increasing the 
Frobenius norm. In Eqn. ([3]), we used spectral submultiplicativity and the fact that Z""" can be 
dropped without changing the spectral norm. In Eqn. we replaced X;^ by Xopt and the factor ^/^ 
appeared in the first term. To better understand this step, notice that X^y gives a 7-approximation 
to the optimal fc-means clustering of C = AfiS, so any other m x k indicator matrix (e.g. Xopt) 
satisfies, 

II {Im - X^XT) AnSlll < 7 min ||(I^ - XX.^)AnS\\l < 7II (l^ - XoptXjpJ AnS|||. 
By using Lemma [7] with 5 = 3/4 and Lemma [6] (for an unspecified failure probability 6), 

11(1^- Xopt xJpjAfisiiF II (zTns)+||2 < y^^Fopt. 

We are now in position to bound 6i. In Lemmas [8] and [6l let 6 = 0.01. Assuming 1 < 7, 

^ (^y-^'W') (^^'^0 

The last inequality follows from our choice of e < 1/3 and elementary algebra. Taking squares on 
both sides. 

Of < (V2 + 94e)%Fopt < (2 + 3900e)7Fopt. 

Overall (assuming 1 < 7), 

||A - X^xTAIII <9l + 9^2<{'^ + 3900e)7Fopt + (1 + 100e)Fopt < Fopt + (2 + 4 • 10=^e)7Fopt. 

Rescaling e accordingly (ci = 16 • 10^) gives the bound in the Theorem. The failure probability 
follows by a union bound on Lemma [7] (with 5 = 3/4), Lemma [8] (with 5 = 0.01), Lemma [6] (with 
6 = 0.01), Lemma [2] (followed by Markov's inequality with 6 = 0.01), and Definition |4] (with failure 
probability (5^). Indeed, 0.75 + 3 • 0.01 + 0.01 + 0.01 + 6^ = 0.8 + 6^ is the overall failure probability, 
hence the bound in the theorem holds w.p. 0.2 — 6y. ■ 

4 Feature Extraction with Random Projections 

We prove that any set of m points in n dimensions (rows in a matrix A G M'"^"') can be projected 
into r = 0{k/€^) dimensions in 0{mn\e~'^k/ \og{n)\) time such that, with constant probability, the 
objective value of the optimal fe-partition of the points is preserved within a factor of 2 + e. The 
projection is done by post-multiplying A with an n x r random matrix R having entries +l/y/r or 
— 1/y/r with equal probability. 
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Input: Dataset A £ M"^^", number of clusters k, and < e < ^. 
Output: C G R"'^'' with r = 0{k/e^) artificial features. 

1: Set r = C2 • A;/e^, for a sufficiently large constant C2 (see proof). 

2: Compute a random n x r matrix R as follows. For alH = 1, . . . , n, j = 1, . . . , r (i.i.d.) 

^ ^ f+l/VF,w.p. 1/2, 
\-l/VF,w.p. 1/2. 

3: Compute C = AR with the Mailman Algorithm (see text). 
4: Return C G M™^^ 



Algorithm 2: Stochastic Feature Extraction for /c-means Clustering. 



The algorithm needs 0{mk/e'^) time to generate R; then, the product AR can be naively 
computed in 0{mnk/ e^). However, one can employ the so-called mailman algorithm for matrix 
multiplication |18j and compute the product AR in 0(m,n[e~^/c/ log(n)] ). Indeed, the mailman 
algorithm computes (after preprocessing) a matrix- vector product of any n-dimensional vector (row 
of A) with an n X log(n) sign matrix in 0{n) time. Reading the input n x log n sign matrix requires 
0(?ilogn) time. However, in our case we only consider multiplication with a random sign matrix, 
therefore we can avoid the preprocessing step by directly computing a random correspondence 
matrix as discussed in jl8t Preprocessing Section] . By partitioning the columns of our n x r matrix 
R into [r/log(n)] blocks, the claim follows. 

Theorem 1 121 is our quality-of-approximation result regarding the clustering that can be obtained 
with the features returned from Algorithm [2] . Notice that if 7 = 1, the distortion is at most 2 + 
as advertised in Table [H If the 7-approximation algorithm is [17] the overall approximation factor 
would be (1 + (1 + e)'^) with running time of the order 0(mn[e~^/c/ log(n)] + 2^^/''^° mk/e^). 

Theorem 12. Let A G 1^™^" andk be the inputs of the k-means clustering problem. Lete G (0, 1/3) 
and construct features C G R"^^*" with r = 0{k/e'^) by using Algorithmic in 0{mn\e~'^k/ log{n)~\) 
time. Run any ^-approximation k-means algorithm on C,k and construct X^y. Then, w.p. at least 
0.96 - 6^, 

II A - X^xTAIII < (1 + (1 + e)7) || A - XoptXjptA|||. 
Proof. We start by manipulating the term ||A — X;yXTA||p. Notice that A = A^ -|- Ap_fc. Also, 

(^(im - X^XT^ Afc^ (^(im - X^XT^ Ap„fc^ = Omxm, because AfcAj_^ = Omxm, by the orthog- 
onality of the corresponding subspaces. Now, using Lemma [H 

||A-X^xTA||| = ||(I^-X^xT)Afc||| + ||(I^ - X;^X?)Ap_fc||| . (5) 

Bl el 

We first bound the second term of Eqn. ([5]). S iiic6 I772 — X^X- is a projection matrix, it can be 
dropped without increasing the Frobenius norm. So, by using this and the fact that XoptXjp^-A 



11 



has rank at most k, 

ej < \\Ap_k\\l =I|A-Afc||2 < ||A - XoptXjptA|||. (6) 
We now bound the first term of Eqn. ([5]), 

03 < ||(I™,-X;^xT)AR(VfcR)+V^||F + ||E||F (7) 

< ||(I^-X;^xT)AR||F||(VfcR)+||2 + ||E||F (8) 

< V7ll(Im-XoptXTjAR||F||(VfeR)+||2 + ||E||F (9) 

< v^^/TT^IKl^-XoptX^ JA||FY^ + 3e||A- AfellF (10) 

< V7(l + 2.5e)||(I^-XoptXjpJA||F + 3eV7ll(Im-XoptXTjA||F (11) 
= V7(l + 5.5e)||(I„-XoptXjpJA||F (12) 



In Eqn. ([7]), we used the second statement of Lemma \T0\ the triangle inequahty for matrix norms, 
and the fact that — X^yX? is a projection matrix and can be dropped without increasing the 
Frobenius norm. In Eqn. ([5D, we used spectral submultiplicativity and the fact that can be 
dropped without changing the spectral norm. In Eqn. Q, we replaced X^j, by Xopt and the factor 
appeared in the first term. To better understand this step, notice that X^ gives a 7-approximation 
to the optimal A;- means clustering of the matrix C, and any other m x k indicator matrix (for 
example, the matrix Xopt) satisfies, 

II (I^ - X;^XT) C\\l < 7 min ||(I^ - XXT)C||| < 7II (l^ - Xopt^'^^^) C|||. 

In Eqn. (llOp . we used the first statement of Lemma [10] and Lemma [9] with Y = (I — XoptXjpj)A. 
In Eqn. (jlip . we used the fact that 7 > 1, the optimality of SVD, and that for any e G (0, 1/3), 
\/r+~e/(l — e) < 1 + 2.5e. Taking squares in Eqn. (fT2]l we obtain, 

el < 7(1 + 5.56)2 ||(I„,-XoptXjpJA||2 < ^(i + i5e)||(I„-XoptXTjA||2. 

Rescaling e accordingly gives the approximation bound in the theorem (c2 = 3330- 15^). The failure 
probability 0.04 + 6-y follows by a union bound on the failure probability 6^ of the 7-approximation 
/c- means algorithm (Definition H]) , Lemma [9l and Lemma [TOl ■ 



Input: Dataset A S M'"^", number of clusters k, and < e < 1. 
Output: C G ]^"^x'= with k artificial features. 

1: Let Z = FastFrobeniusSVD(A, A;,e); Z G (via Lemma[2]). 

2: Return C = AZ G R"''"' . 



Algorithm 3: Stochastic Feature Extraction for /c-means Clustering. 
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5 Feature Extraction with Approximate SVD 

Finally, we present a feature extraction algorithm that employs the SVD to construct r = k artificial 
features. Our method and proof techniques are the same with those of [7] with the only difference 
being the fact that we use a fast approximate (randomized) SVD via Lemma [2] as opposed to the 
expensive exact deterministic SVD. In fact, replacing Z = reproduces the proof in [7]. Our 
choice gives a considerably faster algorithm with approximation error comparable to the error in [7] . 

Theorem 13. Let A G ]g"*x" Qi^g inputs of the k-means clustering problem. Let e S (0, 1) and 
construct features C G j^'^x'^ using Algorithm\M in 0{mnk/e) time. Run any j- approximation 
k-means algorithm on C, k and construct X^. Then, w.p. at least 0.99 — 

II A - X^xTAIII < (1 + (1 + e)7) || A - XoptXjp ,A|||. 

Proof. We start by manipulating the term ||A — X^XTA|||. Notice that A = AZZ"^ + E. Also, 

Im - X^XT^ AZZ'^^ (^(im - X;yXT^ = Omxm, becausc Z'^E'^ = Ofcxm, by construction. 
Now, using Matrix Pythagoras (see Lemma [T] in Section [2]), 

II A - X^xTAIII = 11(1^ - X^XT)AZZT||| + ||(I™ - X;^xT)E||| . (13) 



9f el 



In the proof of Theorem 1111 we argued that w.p. 0.99, 

9l < (l + 100e)Fopt. 
We now bound the first term in Eqn. (jl3p . 



9i < ||(I„-X;^xT)AZ||f (14) 

< V7ll(Im-XoptXTjAZ||F (15) 

< V7ll(Im-XoptXTjA||F (16) 

In Eqn. ()14p . we used spectral submultiplicativity and the fact that ||Z"'"||2 = 1. In Eqn. ()15p . we 
replaced X^j, by Xopt and the factor appeared in the first term (similar argument as in the proof 
of Theorem [TTI) . In Eqn. (I16p . we used spectral submultiplicativity and the fact that ||Z||2 = 1. 
Overall (assuming 7 > 1), 

II A - X^xTA||2 <el + ej< jF^pt + (l + 100e)Fopt < Fopt + (1 + 102e)7Fopt. 

The failure probability is 0.01 + 5^, from a union bound on Lemma [2] and Definitional Finally, 
rescaling e accordingly gives the approximation bound in the theorem. ■ 
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Appendix 



Proof of Lemma\2\ Define the random variable Y = ||YfiS||| > 0. Assume that the fohowing 
equation is true: E||YfiS||p = ||Y|||. Applying Markov's inequality with failure probability 6 
to this equation gives the bound in the lemma. All that it remains to prove now is the above 
assumption. Let B = YfiS G M"*^'', and for t = l,...,r, let B^*) denotes the t-th column of 
B = YfiS. We manipulate the term E ||YfiS||| as follows, 

r n ||-v"(7")ll2 1 

EllYfiSlll ^=^E J;||bW||2 5]E||BW||i ^^p^.Il-^ (J 1^||Y||| = ||Y||2 
t=i t=i t=i j=i ^ t=i 

(a) follows by the definition of the Frobenius norm of B. (6) follows by the linearity of expectation, 
(c) follows by our construction of ri,S. (d) follows by the definition of the Frobenius norm of Y. 
It is worth noting that the above manipulations hold for any set of probabilities since they cancel 
out in Equation (d). ■ 

Proof of Lemma We begin with the analysis of a matrix-multiplication-type term involving the 
multiplication of the matrices E, Z. The sampling and rescaling matrices ft, S indicate the sub- 
sampling of the columns and rows of E, Z, respectively. Eqn. (4) of Lemma 4 of [8] gives a bound 
for such ft, S constructed with randomized sampling with replacement and any set of probabilities 
pi,P2, ■ ■ ■ ,Pn (over the columns of E - rows of Z), 

E IIEZ - EfiSS'^n'^ZllI < V ^ ^ - -||EZ|||. 

Notice that EZ = Omxfc) by construction (see Lemma [21). Now, for every i = 1, . . . ,n replace the 

||2 , ||2 

values Pi = — (in Definition [5]) and rearrange, 

E llESlSS^n^ZllI < -||E|||. (17) 

r 

Observe that Lemma [6] and our choice of r, implies that w.p. 'i — 6, 

1 - e < af{Z^ftS) < 1 + e, for aU i = 1, . . . , A;. (18) 

For what follows, condition on the event of Ineq. (|18p . First, (7fc(Z"'"riS) > 0. So, rank(Z"'"riS) = k 
and (ZTnS)(ZTnS)+ = lE- Now, AZZ^ - AZZ^nS{Z^ftS)+Z^ = AZZ^ - AZIfcZ^ = O^xn- 
Next, we manipulate the term 9 = || AZZ^ - AnS(Z'^f2S)+Z^||F as follows (recah, A = AZZ^ + 
E), 

e = II Azz^ - Azz^ns(zTns)+zT -Er2S(z^s7S)+z^||F = ||Ens(z^ns)+z^||F. 

V ' 

TTi X n 



■"^To see this, let B = Z'^flS G E*^^'' with SVD B = UbSbV^. Here, Ub G R''''*, Sb G and Vb G 

since r>k. Finally, (Z'^flS)(Z'^flS)+ = UbSb V^Vb S^^U^ = Ub SbSb^ = Ifc- 
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Finally, we manipulate the latter term as follows, 



|EnS(zTnS)+Z'^||F < ||ES7S(zTnS)+||F 

< ||Ens(zTns)^||F + ||Ens||F||(zTns)+ - (z^ns)^! 

< ^_||E||p + ^||E||,_< 



2^/5 y/ln{2k/5) V6^/T 



e 




< ^^ + ^^ ||E||f< 

The first inequality follows by spectral submultiplicativity and the fact that ||Z"'"||2 = 1. The second 
inequality follows by the triangle inequality for matrix norms. In the third inequality, the bound for 
the term ||Er2S(Z"'"riS)"'"||F follows by applying to it Markov's inequality together with Ineq. (I17p : 
also, IIEOSIIf is bounded by ||E||f w.p. 1-6 (LemmaED, while we bound ||(ZTnS)+ - (Z'^nS)'^||2 
using Lemma [TBI (set Q = Z and = CIS; we actually use the bound e/\/l — e which can be found 
in the proof of the lemma). So, by the union bound, the failure probability is 35. The rest of the 
argument follows by our choice of r, assuming A; > 2, e < 1/3 and simple algebraic manipulations. 
■ 

Proof of Lemma\^ First, define the random variable Y = ||YR||p. It is easy to see that 'KY = 
II Y lip and moreover an upper bound for the variance of Y is available in Lemma 8 of [23]: Var \Y\ < 
2||Y||p/r [fl. Now, Chebyshev's inequality tells us that, 

n\Y-EY\> 6||Y|||) < < 4^ <^< 0.01. 

e^||Y||| re'^IIYIIp cqK 

The last inequality follows by assuming cq > 100 and the fact that k > 1. Finally, taking square 
root on both sides concludes the proof. ■ 

Proof of Lemma \10l We start with the definition of the Johnson-Lindenstrauss transform. 

Definition 14 (Johnson-Lindenstrauss Transform). A random matrix R G M"^** forms a Johnson- 
Lindenstrauss transform if, for any (row) vector x G W\ 

P((l - e) \\x\\l < \\xK\\l < (1 + e) Wxg) > 1 - e~^'^' 

where C > is an absolute constant. 

Notice that in order to achieve failure probability at most 5, it suffices to take r = O (log (1/(5) /e^). 
We continue with Theorem 1.1 of [l] (properly stated to fit our notation and after minor alge- 
braic manipulations), which indicates that a (rescaled) sign matrix R corresponds to a Johnson- 
Lindenstrauss transform as defined above. 



[24| assumes that the matrix R has i.i.d rows, each one containing four- wise independent zero-mean 
{l/-yr, — l/y?} entries. The claim in our lemma follows because our rescaled sign matrix R satisfies the four- wise 
independence assumption, by construction. 
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Theorem 15 (P). Let A £ ]^iri.><n < e < 1. Let R G R"^'' 6e a resettled mndom sign 
mtttrix with r = || log(m) log(l/(5). Then for ttU i, j = 1, . . . ,m and w.p. ttt least 1 — 6, 

(1 - e)||A(,) - A(,)||2 < II (A(i) - A(,)) R||2 < (1 + e)||A(i) - A(,-)||2. 

In addition, we will use a matrix multiplication bound which follows from Lemma 6 of [23]. The 
second claim of this lemma says that for any X € ]^mx" and Y € IR"^p, if R g M"^'' is a matrix 
with i.i.d rows, each one containing four-wise independent zero-mean {1/^/r, —1/y/r} entries, then, 

E ||XY - XRR^YllI < -||X|||||Y||?,. (19) 

r 

Our random matrix R uses full independence, hence the above bound holds by dropping the limited 
independence condition. 

Statement 1. The first statement in our lemma has been proved in Corollary 11 of [23], see also [6l 
Theorem 1.3] for a restatement. More precisely, repeat the proof of Corollary 11 of [23] paying 
attention to the constants. That is, set C = V^R'^RV^ — 1^ and eo = 1/2 in Lemma 10 of [23], and 
apply our JL transform with (rescaled) accuracy e/4 on each vector of the set T' := {Vja; | £C G T} 
(which is of size at most < e^'"(^^)). So, 

F {yi = l,...,k : 1 - e < afiYjK) < 1 + e) > 1 - e^H^8)^-e^r/{36-w) ^ (20) 

Setting r such that the failure probability is at most 0.01 indicates that r should be at least 
r > 576(fcln(18) -|- ln(100))/e^. So, cq = 3330 is a sufficiently large constant for the lemma. 

Statement 2. Consider the following three events (w.r.t. the randomness of the random matrix 
R): £i := {l-e < af{V^K} < 1 + e}, £2 := {||Ap_fcR||2 < (1 + e)||A^_fc||2} and £3 := 
{IIAp.fcRR^Vfclll < e2||Ap_fc|||}. Ineq. and Lemma [3 with Y = Ap_fc imply that F{£i) > 
0.99, P(i?2) ^ 0.99, respectively. A crucial observation for bounding the failure probability of the 
last event £3 is that Ap„fcV/c = Up_fcSp_fcVj_^Vfc = Omxk by orthogonality of the columns of 
Vfc and Vp_fc. This event can now be bounded by applying Markov's Inequality on Ineq. (|19p 
with X = Ap_/t and Y = and recalling that ||Vfc||p = k and r = c^k/e^. Assuming cq > 200, 
it follows that T{£i) > 0.99 (hence, setting cq = 3330 is a sufficiently large constant for both 
statements). A union bound implies that these three events happen w.p. 0.97. For what follows, 
condition on these three events. 

Let E = Afc - (AR)(VJR)+V^ G M™><'^. By setting A = A^ + Ap„fc and using the triangle 
inequality, 

||E||f < IIAfc- AfcR(vTR)+vJ||F + ||Ap_fcR(VTR)+vT||F. 
The event £1 implies that rank(VjR) = k thu^, 

(VTR)(vjR)+=Ifc. 

^This theorem is proved by first showing that a rescaled random sign matrix is a Johns on- Lindenstrauss trans- 
form, [1] Lemma 5.1] with constant C = 36. Then, setting an appropriate value for r and applying the union bound 
over all pairs of row indices of A concludes the proof. 

*To see this, let B = V^^R G with SVD B = UbEbV^. Here, Ub £ K''''^ Sb G R''''^ and Vb G 

since r>k. Finally, (VJR)(V;^R)+ = UbSb V^Vb S^^U^ = Ub SbSb' = Ifc- 

Ifc Ifc 
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Replacing = UfcSfcV^ and setting (V^R)(V^R)+ = I^, we obtain that 

IIAfc - AfcR(VTR)+vT||F = \\Ak - UfcSfc vTr(vTr)+ vJHp = \\Ak - Vk^k^Jy = 0. 

v ' 

Ik 

To bound the second term above, we drop Vj, add and subtract Ap_/^.R(VjR)"'", and use the 
triangle inequality and spectral sub-multiplicativity, 

||Ap_fcR(vTR)+vT||F < ||A,_,R(vJr)T||f + ||A,_fcR((vTR)+ - (vTr)T)||p 

< ||A^_,.RRTVfe||F + ||Ap_fcR||F||(VjR)+-(V^R)T||2. 

Now, we will bound each term individually. We bound the first term using £3. The second term 
can be bounded using £1 and £2 together with Lemma [16] (set Q = and G = R). Hence, 



|E||f < ||A„_feRRTVfc||F + ||A„_fcR||F||(vTR)+ - {VjRf\ 



2 



< e||Ap_fc||F + V(l + e)||Ap_fc||F • 1.5e 

< e||Ap_fc||F + 2e||Ap_fc||F 
= 3e • II Ap_fc||F. 

The last inequality holds by our choice of e G (0, 1/3). ■ 
Finally, the following technical lemma is useful in Lemma [8] and Lemma [TOl 

Lemma 16. Let Q € W^^^ with n > k and Q"'"Q = Ifc. Let be any n x r matrix (r > k) 
satisfying 1 — e < af{Q"^&) < 1 + e for every i = 1, . . . ,k and < e < 1/3. Then, 

||(QTG)+-(QW||2<1.56. 

Proof Let X = Q'^0 G M'^'^''^ with SVD X = Ux^lxVj. Here, Ux G K^'""^, Sx G R'''"', and 
Vx G since r > k. Consider taking the SVD of (Q'^e)+ and (Q'^B)'^, 



(QT0)+ - (Q^e)T||2 = llVxSx'U^ - VxSxUT II2 = ||Vx(Sx' - ^x)U^||2 = ||Sx' - 



2, 



since Vx and can be dropped without changing the spectral norm. Let Y = - Sx G ^^""'^ 

• 1 



be a diagonal matrix. Then, for all i = 1, . . . , /c, Yjj = ^ Jl^^ • Since Y is diagonal. 



IYII2 = max |Yjj| = max 

l<i<fc l<i<fc 



CT,;(X) 



max — - — - < ^ < 1.5e. 

i<j<fe o-j(X) Vl-e 



The first equality follows since the singular values are positive (from our choice of e and the left 
hand side of the bound for the singular values). The first inequality follows by the bound for the 
singular values of X. The last inequality follows by the assumption that 0<e<l/3. ■ 
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